dataset condensation
DC-BENCH: Dataset Condensation Benchmark
Dataset Condensation is a newly emerging technique aiming at learning a tiny dataset that captures the rich information encoded in the original dataset. As the size of datasets contemporary machine learning models rely on becomes increasingly large, condensation methods become a prominent direction for accelerating network training and reducing data storage. Despite numerous methods have been proposed in this rapidly growing field, evaluating and comparing different condensation methods is non-trivial and still remains an open issue. The quality of condensed dataset are often shadowed by many critical contributing factors to the end performance, such as data augmentation and model architectures. The lack of a systematic way to evaluate and compare condensation methods not only hinders our understanding of existing techniques, but also discourages practical usage of the synthesized datasets. This work provides the first large-scale standardized benchmark on Dataset Condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods through the lens of their generated dataset. Leveraging this benchmark, we conduct a large-scale study of current condensation methods, and report many insightful findings that open up new possibilities for future development. The benchmark library, including evaluators, baseline methods, and generated datasets, is open-sourced1 to facilitate future research and application.
CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting
The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets. However, existing methods predominantly concentrate on classification tasks, posing challenges in their adaptation to time series forecasting (TS-forecasting). This challenge arises from disparities in the evaluation of synthetic data. In classification, the synthetic data is considered well-distilled if the model trained with the full dataset and the model trained with the synthetic dataset yield identical labels for the same input, regardless of variations in output logits distribution. Conversely, in TS-forecasting, the effectiveness of synthetic data distillation is determined by the distance between predictions of the two models. The synthetic data is deemed well-distilled only when all data points within the predictions are similar.
Elucidating the Design Space of Dataset Condensation
Dataset condensation, a concept within $\textit{data-centric learning}$, aims to efficiently transfer critical attributes from an original dataset to a synthetic version, meanwhile maintaining both diversity and realism of syntheses. This approach can significantly improve model training efficiency and is also adaptable for multiple application areas. Previous methods in dataset condensation have faced several challenges: some incur high computational costs which limit scalability to larger datasets ($\textit{e.g.,}$ MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets ($\textit{e.g.,}$ SRe$^2$L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive designing-centric framework that includes specific, effective strategies like implementing soft category-aware matching, adjusting the learning rate schedule and applying small batch-size. These strategies are grounded in both empirical evidence and theoretical backing.
CondTSF: One-line Plugin of Dataset Condensation for Time Series Forecasting
The objective of dataset condensation is to ensure that the model trained with the synthetic dataset can perform comparably to the model trained with full datasets. However, existing methods predominantly concentrate on classification tasks, posing challenges in their adaptation to time series forecasting (TS-forecasting).
SupplementaryMaterialsfor" PrivateSetGeneration withDiscriminativeInformation "
To compute the privacy cost of our approach, we numerically computeDα(M(D) M(D)) in Definition A.1 for a range of ordersα [9, 14] in each training step that requires access to the real gradientgDθ . In comparison to normal non-private training, the major part of the additional memory and computation costisintroduced bytheDP-SGD [1]step(fortheper-sample gradient computation) that sanitizes the parameter gradient on real data, while the other steps (including the update onS, and theupdates ofF(;θ)onS areequivalent tomultiple calls ofthenormal non-privateforward and backward passes (whose costs havelower magnitude than theDP-SGD step). GS-WGAN [3] 5 We adopt the default configuration provided by the official implementation (ε=10): thesubsamplingrate =1/1000,DPnoisescaleσ =1.07,batchsize=32. Following[3], we pretrain (warm-start) the model for2K iterations, and subsequently train for 20K iterations. The experiments presented in Section 5.2 of the main paper correspond to the classincremental learning setting [10]where thedata partition ateach stage contains data from disjoint subsets of label classes.